Automating the extraction of data from HTML tables with unknown structure

نویسندگان

  • David W. Embley
  • Cui Tao
  • Stephen W. Liddle
چکیده

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of interest within a Web page, recognize attributes and values within the table, pair attributes with values, and form records. Data-integration techniques allow us to match source records with a target schema. Ontologically specified wrappers allow us to extract data from source records into a target schema. Experimental results show that we can successfully locate data of interest in tables and map the data from source HTML tables with unknown structure to a given target database schema. We can thus “directly” query source data with unknown structure through a known target schema.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automating the Extraction of Data from HTML Tables with Unknown Structure

The authors propose a solution to the problem of web information extraction, which aims to extract relevant information out of webpages. However since this is a broad field they have limited their work to information which is available in HTML tables found on the Web and relates to a specific domain of interest. As a running example in their paper, the authors use car advertisements. I suggest ...

متن کامل

Automatic Ontology-Based Knowledge Extraction from Web Documents vs. Automating the Extraction of Data from HTML Tables with Unknown Structure

In this report we compare the papers [AKM + 03] and [ETL03]. We show that the two proposed systems realize different goals with the same or similar underlying technics. • Source data of interest [ETL03] takes web pages containing HTML tables of interest for a given application domain as the input whereas [AKM + 03] considers unstructured text from webpages for the knowledge extraction process. ...

متن کامل

Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. The solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to recognize attributes...

متن کامل

Information Extraction from HTML Pages and its Integration

We propose a method of transformation and integration of HTML tables into a common XML list structure. HTML tables tend to have diversified structures, and such integration will help us browse and compare all related information in separate HTML pages simultaneously. This paper focuses on tasks of information extraction from tables and data categorization. For this purpose, we applied three alg...

متن کامل

Mining Tables from Large Scale HTML Texts

Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 54  شماره 

صفحات  -

تاریخ انتشار 2005